1 Spatial operations

1.1 This week

Understanding spatial properties, relationships and how they are used within spatial operations are the building blocks to spatial data processing and analysis. This tutorial takes you through a simple approach to measuring greenspace access for schools in London, using geometric operations as the main methods for processing and analysing your data. You will construct a buffer dataset around our greenspace and determine whether nearby schools intersect with this buffer. We will first visualise our data as points to see if we can identify areas of high versus low access - and then aggregate the data to the ward level for potential further use within analysis with statistical data, such as census information.

1.2 Case study

Recent research (Bijnens et al. 2020) has shown that children brought up in proximity to greenspace have a higher IQ and fewer behavioral problems, irrespective of socio-economic background. In our analysis today, we will look to understand whether there are geographical patterns to schools that have high versus low access of greenspace and where a lack of greenspace needs to be addressed in London. Below, we can see where schools are located in London and get a general understanding of their proximity to large greenspace just through a simple navigation of the map. In this practical we will try to quantify these visual patterns we may observe and find out which schools are within 400 metres of greenspace that is larger than a football pitch.

1.3 Getting started

To enable the efficient, repeatable and reproducible functionality of our work, we will use R-Studio’s ability to create and host code as a script. Before we do anything therefore, we will need to create a new R script: File > New File > R Script

Let’s go ahead and save our script now, so we know it’s stored in our system - and in the future, we only need to remind ourselves to complete a quick save (e.g. cmd / ctrl + s).

Now we have our script created, we can create a new data folder within the same directory as where we saved our script then copy over our data into this folder. Call this folder data. Next, if you haven’t already, download this week’s practical data zipfile. Once downloaded, we want to move this zipfile from your Downloads and into this newly created data folder. You can do this either in your computer OS’s normal file management tool, e.g. finder in Mac OS, or you can do this using R-Studio within the Files pane. Lastly, we want to unzip the file to gain access to the data within our R environment.

File download

File Type Link
London school and greenspace data shp Download

1.4 Libraries

The first two tasks you will do everytime you create a new script is to first, point your computer to your working directory (so it knows where all your data is) and second, (pre-emptively) load many of the libraries you think you’ll be using in your analysis. Luckily, we can now use the latter to also do the former - saving us a little bit of time in our set-up and a lot of time in our script-writing later on - by using the here library. Install this library if you have not already done so!

# libraries
library(here)

But this library alone will not be enough for our analysis today! At the moment, we know that we’ll need to load some spatial data - therefore we need to load a library capable of handling our datsets. For this module, and preferably within your R programming moving forward, we will focus on using the sf or Simple Features library that allows us to load, manipulate, process and export spatial data.

Note
Prior to the sf library, which was introduced to R-Studio in 2017, the sp library was the main spatial library used in R-Studio. As a result, you may see older code or scripts using the sp to handle spatial data. sf has replaced sp as the default spatial library as it works better with the tidyverse way of doing things. Ultimately, you can convert between the two library formats (and some other libraries we will use later on in the term still only work with sp) - but it is best practice to try to use the sf library in your code moving forward.

In addition to the sf library, we want to add in the tidyverse library that will allow us to use the pipe function (%>%), amongst other things, within our work and enable more efficient programming. We are likely going to need some additional libraries to help further manipulate or visualise our datasets as we move forward with our processing - we’ll add these in now, but explain them in a little more detail as we get to use them in our code. These libraries include: units and tmap.

As a result, the top of our script should look something like:

# libraries
library(here)
library(tidyverse)
library(sf)
library(tmap)
library(units)

1.5 Loading our datasets

For this analysis we have three different datasets available: schools in London, greenspace in London (split into two separate datasets), and wards (an administrative geography) in London. All three of our datasets are provided as shapefiles which will make working with the data relatively straight-forward (e.g. even for our point data, the schools, we do not need to convert them from a csv as we often find with this type of data). But we’ll need to do quite a few steps of processing to get our final dataset.

Let’s go ahead and load our three variables - we will use the sf library st_read() command to load our datasets into variables for use within our code:

# load london schools
london_schools <- st_read(here::here('data/schools', 'school_data_london_2016.shp'))
## Reading layer `school_data_london_2016' from data source 
##   `/Users/Tycho/Dropbox/UCL/Web/jtvandijk.github.io/GEOG0114/data/schools/school_data_london_2016.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 3889 features and 44 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -63477.1 ymin: 6665527 xmax: 40792.11 ymax: 6751279
## Projected CRS: WGS 84 / Pseudo-Mercator
# load london wards
london_wards <- st_read(here::here('data/administrative_boundaries', 'london_wards.shp'))
## Reading layer `london_wards' from data source 
##   `/Users/Tycho/Dropbox/UCL/Web/jtvandijk.github.io/GEOG0114/data/administrative_boundaries/london_wards.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 633 features and 6 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 503568.2 ymin: 155850.8 xmax: 561957.5 ymax: 200933.9
## Projected CRS: OSGB 1936 / British National Grid
# load london greenspace
TL_greenspace <- st_read(here::here('data/greenspace', 'TL_GreenspaceSite.shp'))
## Reading layer `TL_GreenspaceSite' from data source 
##   `/Users/Tycho/Dropbox/UCL/Web/jtvandijk.github.io/GEOG0114/data/greenspace/TL_GreenspaceSite.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 8614 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: 499139.7 ymin: 198946.3 xmax: 601149.4 ymax: 300165
## z_range:       zmin: 0 zmax: 0
## Projected CRS: OSGB 1936 / British National Grid
TQ_greenspace <- st_read(here::here('data/greenspace', 'TQ_GreenspaceSite.shp'))
## Reading layer `TQ_GreenspaceSite' from data source 
##   `/Users/Tycho/Dropbox/UCL/Web/jtvandijk.github.io/GEOG0114/data/greenspace/TQ_GreenspaceSite.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 19575 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: 499419.4 ymin: 99770.75 xmax: 600740.5 ymax: 201001.5
## z_range:       zmin: 0 zmax: 0
## Projected CRS: OSGB 1936 / British National Grid

To see what each variable looks like, you can type in plot(name_of_variable) into the R console. This is a quick command to understand both the spatial coverage and attributes of your data - as it will display the data by each of its attribute fields as a plot.

1.6 Data Processing

Now we have our data loaded as variables, we’re ready to start processing! In spatial data processing, the question always is: where do I start first? And the easiest answer to that is: make sure all of your data is in the same Projected (or Coordinate) Reference System as each other. Checking - and changing projections - should always be the first step of any workflow as this will ensure you do not carry through any potential mistakes or errors that using the wrong system can cause.

1.6.1 Reprojecting

When you loaded your datasets in the above step, you may have notice that in the console additional information about the dataset is printed - this includes the metadata on the dataset’s Coordinate Referene System! As a result, it is quite easy to simply scroll the terminal to check the CRS for each dataset - which as you’ll see, all the datasets bar the school are using EPSG 27700, whih is the code for British National Grid, whereas our schools dataset shows 3857, the code for WGS84. That means we need to start with our london_schools variable - as we know that this is the only dataset currently in the wrong projection (WGS84) instead of using British National Grid.

To reproject our dataset, we can use a function within the sf library, known as st_transform(). It is very simple to use - you only need to provide the function with the dataset and the code for the new CRS you wish to use with the data. For now, we will simply store the result of this transformation as a new variable - but you could in the future, rewrite this code to use pipes to pipe this transformation when loading the dataset.

# reproject london schools form WGS84 to BNG 
london_schools_prj <- st_transform(london_schools, 27700)

We can now double-check our new variable is in the correct CRS by typing the following into the console and checking the result:

# check CRS 
st_crs(london_schools_prj)
## Coordinate Reference System:
##   User input: EPSG:27700 
##   wkt:
## PROJCRS["OSGB 1936 / British National Grid",
##     BASEGEOGCRS["OSGB 1936",
##         DATUM["OSGB 1936",
##             ELLIPSOID["Airy 1830",6377563.396,299.3249646,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4277]],
##     CONVERSION["British National Grid",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",49,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",-2,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",0.9996012717,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",400000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",-100000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["(E)",east,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["(N)",north,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["Engineering survey, topographic mapping."],
##         AREA["United Kingdom (UK) - offshore to boundary of UKCS within 49°45'N to 61°N and 9°W to 2°E; onshore Great Britain (England, Wales and Scotland). Isle of Man onshore."],
##         BBOX[49.75,-9,61.01,2.01]],
##     ID["EPSG",27700]]

As you can see from the output above, our dataset has been reprojected into EPSG 27700 or British National Grid!

The next step to process our london_schools_prj dataset is to reduce the schools to only our chosen London extent. As you can see from the map above, our schools cover an area larger than our usual London extent. We can even make a quick map of this to check this properly:

# inspect
tm_shape(london_wards) + 
  tm_polygons() + 
tm_shape(london_schools_prj) + 
  tm_dots()

As we can see, we indeed have schools outside of our London wards - as a result, we want to remove those schools outside of this boundary. We will do this by first dissolving our ward file to create a more simplified shapefile for use as a “cookie-cutter”.

1.6.2 Dissolving

To dissolve a polygon shapefile using R code, we will use the summarise() function that comes from the dplyr library (part of the tidyverse) and summarise our London Wards dataset by summing its total area (supplied in the HECTARES attribute field/column) across all records. This will reduce our data frame to a single row, which will only contain one attribute - our total area of London, which we can then map/use as our clip (cookie-cutter) feature!

# dissolve
london_outline <- london_wards %>% summarise(area = sum(HECTARES))

# inspect
tm_shape(london_outline) +
  tm_polygons()

1.6.3 Subsetting

Now we have out London outline, we can go ahead and clip our schools dataset by our London outline. Whilst there is a clip function within the sf library, what we will do here is use a techinque known as spatial subsetting, which is more similar to selecting by location: we will subset our london schools dataset by filtering out those that are not within the London Outline. This approach in R is much quicker than using the clip function - although deciding which approach to use is not only a question of speed but also how each function will affect the filtered data. When using a clip function, the function acts exactly like a cookie-cutter and will trim off any data that overlaps with the boundaries used. Conversely, when using a subsetting approach, if a data point or polygon overlaps on the boundary, it will still be included (depending on the topological relationship used) but in its entirety (i.e. no trimming!).

As we’re using point data, it is generally easier to use a subset approach. There are multiple ways to conduct spatial subsetting within R:

First, we can either use [] just like you would use for selecting and slicing a normal (table-based) dataframe from R’s base package - or we can use the filter() function from dplyr within the tidyverse. Second, sf has its own library of subsetting through geometric operations, including: intersection, difference, symmetrical difference and snap.

To keep things simple, we will use the base subsetting approach - which also works similarly when programming in Python, for instance.

# subset London schools
london_schools_prj_ss <- london_schools_prj[london_outline,]

Note
In a case like above, you can just overwrite the current london_schools_prj variable as you know it is the dataset you want to use. Much of this code could be condensed into several lines using pipes to make our code shorter and more efficient - but then it would be harder to explain! As you progress with R and programming, you are welcome to bring pipes and restructuring into own your code - but even if you don’t, as long as your code does what you need it to do, then that’s our main aim with this course!

Once you have run the above code, you should notice that your london_schools_prj_ss variable now only contains 3,372 records, instead of the original 3,889. We can also plot our variable using the same code as above, to double-check that it worked:

# inspect
tm_shape(london_wards) + 
  tm_polygons() + 
tm_shape(london_schools_prj_ss) + 
  tm_dots()

We should now see that our schools are all contained within our ward dataset, so we know this dataset is ready to be used for analysis. We will now explore which schools are within 400m of greenspace and which are not. But first, we now need to get our greenspace data ready so we can create the 400m buffers needed for this analysis.

1.6.4 Unioning

We’ve done a lot of processing so far to do with our schools and ward data, but now it’s time for the greenspace datasets. If you look back at your code, you should remember that we have two datasets for our greenspace in London, which we now need to join together. This type of join is typically known as a union - and this is the type of tool you would want to look for across any GUI system.

When it comes to programming, however, in either R or python, there is a much simpler way of joining datasets - and that’s simply copying over the records or observations from one variable into another - and the sf library has a ready-to-go function for us to use, known as rbind(). This function allows you to ‘bind’ rows from one sf variable to another, including copying the geometry column over:

# join greenspace datasets together
greenspace = rbind(TQ_greenspace, TL_greenspace)

1.6.5 Clipping

The next step is to clip our reduced greenspace data to our London outline. Within sf, the clip function is known as the st_intersection() function - not to be confused with st_intersects() from above! A clip will change the geometry of some of our greenspaces on the outskirts of London, i.e. cookie-cut them precisely to the London outline. If we used the subset approach approach as we did earlier with our point data, we would simply extract all greenspaces that intersect with the London outline - but not change their geometry.

What we can do however if reduce the processing required by our computer by using a mixture of these two methods - if we first subset our all_greenspace dataset by our London outline and then run the clip, our processing will be much faster:

# subset and clip
london_greenspace <- greenspace[london_outline,] %>% st_intersection(london_outline)
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
# inspect
tm_shape(london_outline) + 
  tm_polygons() + 
tm_shape(london_greenspace) + 
  tm_polygons()

1.6.6 Attribute selection

Now we have only London greenspaces in our dataset, the next step, is to reduce the number of greenspaces to only those bigger than a football pitch. To do this, we will use another type of subsetting you’ve probably come across, which is attribute subsetting - by using a simple query to subset only records that have an area larger than 76,900 square feet or 7,140 square metres. To do this, we’ll use the filter() function from the dplyr library we mentioned earlier as well as another function called set_units() which is from the unit library that you’ve loaded - but we haven’t yet discussed. The set_units() function allows us to assign units to numerical values we are using within our query, i.e. here, for our query to run, our value must be in square metres to match the unit of the area_m column.

To be able to query on our area, we must first calculate the area of each of our greenspaces. To do so in R, we can use the st_area() function within sf, which will calculate the area of each of our records/observations in our greenspace dataset. To store the output of this function as a new column in our london_greenspace dataset, we use a simple notation at the end of our london_greenspace variable: $area_m. The $ in R means for this data frame, access the column that proceeds this sign. In our case, we do not as yet have a column called area_m, therefore R will automatically create this column and then store the outputs of the function in this column:

# calculate area
london_greenspace$area_m <- st_area(london_greenspace)

Once we have our area column, we can now filter our dataset based on that column and filter out all greenspace with an area that is smaller than 7,140 square meters (i.e. the size of a football pitch!).

# filter large greenspaces
large_london_greenspace <- london_greenspace %>% filter(area_m > set_units(76900.0, m^2))

We now can look at our final greenspace dataset against our london outline to see its final coverage:

# inspect
tm_shape(london_outline) + 
  tm_polygons() + 
tm_shape(large_london_greenspace) + 
  tm_polygons()

1.6.7 Buffering

We now have our London greenspace dataset - we are ready for the last step of processing with this dataset - generating our buffers that we can use to find all schools within 400 meters of the large greenspace areas. Once again, the sf library has a function for generating buffers - we just need to know how to deploy it successfully on our London greenspace dataset - and this involves understanding how to denote our distance correctly - as well as understanding if and how we can dissolve our buffer into a single record.

To do this, we would investigate the documentation of the function st_buffer() to find out what additional parameters it takes - and how. What we can find out is that we need to (of course!) provide a distance for our buffer - but whatever figure we supply, this will be interpreted within the units of the CRS we are using. In our case, we are using British National Grid and, luckily for us, the units of the CRS is metres - which makes are life significantly easier when calculating these buffers. For other CRS, many use a base unit of an Arc Degree, e.g. WGS84. In this case, you technically have two options: 1) reproject your data into a CRS that uses metres as its base unit OR 2) convert your distance into an Arc Degree measurement. Always choose Option 1.

Fortunately none of this is our concern - we know we can simply input the figure of 400 into our buffer and this will generate a buffer of 400m.

# greenspace buffer
gs_buffer_400m <- st_buffer(large_london_greenspace, dist=400)

As our final bit of processing with our greenspace buffer, we want to dissolve the whole buffer into a single record. To do this, we’ll replicate the code used for our London ward dissolve, creating a an area value for our buffer records in the process to be used within the summarisation - and then result in a new gs_buffer_400m_single variable:

# dissolve greenspace buffer
gs_buffer_400m_single <- gs_buffer_400m %>% summarise(area = sum(st_area(gs_buffer_400m)))

# inspect
tm_shape(london_outline) + 
  tm_polygons() + 
tm_shape(gs_buffer_400m_single) + 
  tm_polygons()

1.7 Greenspace in London

Great, we are now ready to bring our two datasets together reaady for anlaysis - and to do so, we’ll use subsetting as well as the st_intersects() function, although with this one, we’ll use it in two different ways!

Our first task is to identify those schools that have access to greenspace - and extract them to create a new variable for use within our final point-in-polygon count (i.e. how many schools within each ward has access to greenspace). As we know, we can subset our london_schools dataset by our greenspace buffer quite easily using the subset approach:

# schools within 400m of greenspace
london_schools_gs <- london_schools_prj_ss[gs_buffer_400m_single,]

Our london_schools_gs variable has been subsetted correctly if we end up with 1,477 records, instead of the 3,372 records we had previously. We can now use this dataset and our previous london_schools_prj_ss dataset to create counts at the ward level. But before we do that, we will create a binary attribute of greenspace access within our london_schools_prj_ss variable to visualise our school ‘points’. To do this, we’ll use the st_intersects() function mentioned above and add a new column, gs_access (i.e. greenspace access), which will tell us which schools have access to greenspace or not.

The st_intersects() function is really useful as its output is a simple TRUE or FALSE statement - does this record intersect with the greenspace buffer? This result is what will be stored in our new column as a TRUE or FALSE response and what we can use to map our schools and their greenspace access:

# greenspace access
london_schools_prj_ss$gs_access <- st_intersects(london_schools_prj_ss, gs_buffer_400m_single, sparse=FALSE)

We could go ahead and recode this to create a 1 or 0, or YES or NO after processing, but for now we’ll leave it as TRUE or FALSE. We can go head and now visualise our schools based on this column, to see if they have access (TRUE) or do not have access (FALSE) to greenspace. To do this, we’ll use the tmap library again:

# inspect
tm_shape(london_schools_prj_ss) + 
  tm_dots(col='gs_access', palette='BuGn')

You’ll be pleased to read that we are finally here - we are at the last stage of our processing and can finally create our rate of schools that have greenspace access, versus those that do not! To do this, we’ll be counting the number of points in each of our polygons, i.e. the number of schools in each ward.

To do so in R and with sf, it is one line of code - which at first look does not sound at all like it is completing a point-in-polygon calculation - but it does! To create a PIP count within sf, we use the st_intersects() function again - but instead of using the output of TRUE or FALSE, what we actually extract from our function is its lengths recording. The lengths part of this function records how many times a join feature (i.e. our schools) intersects with our main features (i.e. our wards). (Note here, we do not set the sparse function to FALSE but leave it as TRUE/its default by not entering the parameter). As a result, the length of this list is equal to the count of how many schools are within the polygon - i.e. a PIP calculation.

This is a really simple way of doing a PIP calculation - and makes it easy for us to store the output of the function and its lengths (and thus the count) directly as a column within our london_wards dataset, as so:

# total number of schools in each ward 
london_wards$total_schools <- lengths(st_intersects(london_wards, london_schools_prj_ss))

# total number of schools with greenspace access in each ward
london_wards$gs_schools <- lengths(st_intersects(london_wards, london_schools_gs))

As you can seee from the code above, we’ve now calculated this for our total schools dataset and the schools that have access to greenspace. The final step in our processing therefore is to create our rate. To do so, we’ll use the same approach of generating a new column within our london_wards dataset - and then use a mathematical formula to calculate our rates:

# percentage of schools with greenspace access
london_wards$gs_rate <- (london_wards$gs_schools/london_wards$total_schools)*100

And that’s it! We now have our greenspace rate for our wards, which we can now again map:

# inspect
tm_shape(london_wards) + 
  tm_polygons(col='gs_rate', palette='Greens')

We now have our final dataset ready for analysis. Right now, we haven’t introduced you to any statistical or spatial analysis techniques to fully analyse our dataset - but instead, we can focus on what are data shows visually!

The last step of any programming is to extract our variables into permanent datasets for use at a later time. You can at any point in this practical, extract a permanent data file for each of our variables. For now, we’ll extract our new london_wards dataset as we might want to use this in some additional analysis that we could look at next week or for our assessments at a later stage. The great thing about coding this up now, is that it will be easy to re-run all of this analysis and export any of the variables, again, at a later time!

# write
st_write(obj=london_wards, dsn='data/london_ward_gs.shp', delete_dsn=TRUE)

You should now see the dataset appear in your files!

1.8 Attributions

This week’s content and practical uses content and inspiration from: